Install and load the janitor package. Use its
clean_names() function on the Pokémon data, and save the
results to work with for the rest of the assignment. What happened to
the data? Why do you think clean_names() is useful?
pokemon <- read.csv("data/Pokemon.csv") %>% clean_names()
pokemon %>%
head(30) %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| x | name | type_1 | type_2 | total | hp | attack | defense | sp_atk | sp_def | speed | generation | legendary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
| 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
| 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
| 4 | Charmander | Fire | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | |
| 5 | Charmeleon | Fire | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False | |
| 6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
| 6 | CharizardMega Charizard X | Fire | Dragon | 634 | 78 | 130 | 111 | 130 | 85 | 100 | 1 | False |
| 6 | CharizardMega Charizard Y | Fire | Flying | 634 | 78 | 104 | 78 | 159 | 115 | 100 | 1 | False |
| 7 | Squirtle | Water | 314 | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False | |
| 8 | Wartortle | Water | 405 | 59 | 63 | 80 | 65 | 80 | 58 | 1 | False | |
| 9 | Blastoise | Water | 530 | 79 | 83 | 100 | 85 | 105 | 78 | 1 | False | |
| 9 | BlastoiseMega Blastoise | Water | 630 | 79 | 103 | 120 | 135 | 115 | 78 | 1 | False | |
| 10 | Caterpie | Bug | 195 | 45 | 30 | 35 | 20 | 20 | 45 | 1 | False | |
| 11 | Metapod | Bug | 205 | 50 | 20 | 55 | 25 | 25 | 30 | 1 | False | |
| 12 | Butterfree | Bug | Flying | 395 | 60 | 45 | 50 | 90 | 80 | 70 | 1 | False |
| 13 | Weedle | Bug | Poison | 195 | 40 | 35 | 30 | 20 | 20 | 50 | 1 | False |
| 14 | Kakuna | Bug | Poison | 205 | 45 | 25 | 50 | 25 | 25 | 35 | 1 | False |
| 15 | Beedrill | Bug | Poison | 395 | 65 | 90 | 40 | 45 | 80 | 75 | 1 | False |
| 15 | BeedrillMega Beedrill | Bug | Poison | 495 | 65 | 150 | 40 | 15 | 80 | 145 | 1 | False |
| 16 | Pidgey | Normal | Flying | 251 | 40 | 45 | 40 | 35 | 35 | 56 | 1 | False |
| 17 | Pidgeotto | Normal | Flying | 349 | 63 | 60 | 55 | 50 | 50 | 71 | 1 | False |
| 18 | Pidgeot | Normal | Flying | 479 | 83 | 80 | 75 | 70 | 70 | 101 | 1 | False |
| 18 | PidgeotMega Pidgeot | Normal | Flying | 579 | 83 | 80 | 80 | 135 | 80 | 121 | 1 | False |
| 19 | Rattata | Normal | 253 | 30 | 56 | 35 | 25 | 35 | 72 | 1 | False | |
| 20 | Raticate | Normal | 413 | 55 | 81 | 60 | 50 | 70 | 97 | 1 | False | |
| 21 | Spearow | Normal | Flying | 262 | 40 | 60 | 30 | 31 | 31 | 70 | 1 | False |
| 22 | Fearow | Normal | Flying | 442 | 65 | 90 | 65 | 61 | 61 | 100 | 1 | False |
| 23 | Ekans | Poison | 288 | 35 | 60 | 44 | 40 | 54 | 55 | 1 | False | |
| 24 | Arbok | Poison | 438 | 60 | 85 | 69 | 65 | 79 | 80 | 1 | False |
The clean_names() function renamed the column
labels / names of the predictors by removing any uppercase letters and
replacing the . delimiters with underscore characters to
make it “snake_case” (the default setting of
clean_names()).
———— ### Exercise 2
Using the entire data set, create a bar chart of the outcome
variable, type_1.
How many classes of the outcome are there? Are there any Pokémon types with very few Pokémon? If so, which ones?
For this assignment, we’ll handle the rarer classes by grouping them,
or “lumping them,” together into an ‘other’ category. Using the forcats
package, determine how to do this, and lump all the other
levels together except for the top 6 most frequent (which are
Bug, Fire, Grass, Normal, Water, and Psychic).
Convert type_1 and legendary to
factors.
p<- ggplot(pokemon, aes(x=type_1)) + geom_bar() +
labs(x="Primary Type") + theme_gray() + theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))
ggplotly(p)
# Convert type_1 and legendary to factors
pokemon <- pokemon %>%
mutate(type_1=factor(type_1)) %>%
mutate(legendary=factor(legendary)) %>%
mutate(generation=factor(generation))
pokemon$type_1 <- fct_collapse(pokemon$type_1, Other=c("Steel", "Rock", "Poison", "Ice",
"Ground", "Ghost","Flying", "Fighting",
"Fairy","Electric","Dragon","Dark"))
There are initially 18 different primary types ofPokémon — the
flying type and fairy type have significantly low numbers of Pokémon.
While this makes sense for the fairy type (since it wasn’t added until
Generation VI) it’s a bit odd that a primary type existing since
Generation I has such a low count.
————– ### Exercise 3
Perform an initial split of the data. Stratify by the outcome variable. You can choose a proportion to use. Verify that your training and test sets have the desired number of observations.
Next, use v-fold cross-validation on the training set. Use 5
folds. Stratify the folds by type_1 as well. Hint: Look
for a strata argument.
Why do you think doing stratified sampling for cross-validation is useful?
set.seed(3435)
pokemon_split <- initial_split(pokemon, prop = 0.7, strata = type_1)
pokemon_train <- training(pokemon_split)
pokemon_test <- testing(pokemon_split)
# 5-fold cross validation
pokemon_folds <- vfold_cv(pokemon_train, v = 5, strata = type_1)
# Verify that the training and testing sets have the correct number of observations
data.frame(train = c(count(pokemon_train)), test = c( count(pokemon_test) )) %>% rename( Train = n, Test = n.1)
## Train Test
## 1 559 241
Stratified sampling in cross-validation ensures that each
fold has roughly the same distribution / proportion of classes
in each fold. This ensures that models produced on each fold should
scale well to the original dataset, and will likely better generalize to
unseen data.
Create a correlation matrix of the training set, using the
corrplot package. Note: You can choose how to handle
the categorical variables for this plot; justify your
decision(s).
What relationships, if any, do you notice?
# Hot-One the legendary predictor
pokemon_train$is_legendary <- as.numeric(pokemon_train$legendary) - 1
pokemon_train %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color')
# Get rid of 'dummy' column for one-hot encoding since we only
# use that for the corrplot in this assignment
pokemon_train <- subset(pokemon_train, select=-is_legendary)
We are able to one-hot the legendary status of a Pokémon incredibly
easily because there are only two categorical classes: True and False.
Despite it being easy to include, it is also worthwhile keeping the
legendary (actually a temporary is_legendary column
containing the one-hot-encoded values) since it is a valid question of
whether “legendary” Pokémon have better stats than their ordinary
counterparts.
However, we choose not to include the type_1 categorical
variable since we are ultimately lumping 12 potentially very different
classes into a new “Other” class, essentially diluting whatever possible
correlation information those classes could have contributed. While we
could have potentially included type_2 (since the classes
are not collapsed), it would be a bit difficult to one-hot encode 18
different classes in such a way that correlation information is readable
for each individual class.
After looking at the results of the correlation matrix, we first
address the expected relations. Possibly the most trivial is the perfect
correlation between generation and x (which is the PokéDex
entry number) — for those who have played the games, this is completely
expected since new Pokémon introduced in each generation are added at
the end of where the PokéDex left off. Thus, generation
II Pokémon’s PokéDex entries begin after the last Generation I Pokémon.
In addition, the relation between ‘total’ and each Pokémon stat is
expected, since the total is simply the sum of the stats for each
Pokémon.
Possibly the most surprising result is that there are no negative correlations between Pokémons’ stats. As a competitive game / series, Pokémon must be kept an even playing field to ensure players are able to strategically enjoy the game. What this usually means is that Pokémon that specialize in attack (e.g. most Fire-type Pokémon) have lower defense of special defense stats — however, the correlation plot above does not indicate this in any way.
Set up a recipe to predict type_1 with
legendary, generation, sp_atk,
attack, speed, defense,
hp, and sp_def.
Dummy-code legendary and
generation;
Center and scale all predictors.
pokemon_recipe <-
recipe(type_1 ~ legendary + generation + sp_atk + attack + speed + defense + hp + sp_def, data = pokemon_train) %>%
step_dummy(legendary) %>%
step_dummy(generation) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
prep(pokemon_recipe) %>% bake(new_data = pokemon_train)
We’ll be fitting and tuning an elastic net, tuning
penalty and mixture (use
multinom_reg() with the glmnet engine).
Set up this model and workflow. Create a regular grid for
penalty and mixture with 10 levels each;
mixture should range from 0 to 1. For this assignment, let
penalty range from 0.01 to 3 (this is on the
identity_trans() scale; note that you’ll need to specify
these values in base 10 otherwise).
elastic_net_model <- multinom_reg(mixture = tune(),
penalty = tune()) %>%
set_engine("glmnet")
elastic_net_wflow <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(pokemon_recipe)
elastic_net_grid <- grid_regular(penalty(range=c(0.01, 3), trans = identity_trans()),
mixture(range(0,1)),
levels = 10)
Now set up a random forest model and workflow. Use the
ranger engine and set importance = "impurity";
we’ll be tuning mtry, trees, and
min_n. Using the documentation for
rand_forest(), explain in your own words what each of these
hyperparameters represent.
Create a regular grid with 8 levels each. You can choose plausible
ranges for each hyperparameter. Note that mtry should not
be smaller than 1 or larger than 8. Explain why neither of those
values would make sense.
What type of model does mtry = 8 represent?
random_forest_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
random_forest_wflow <- workflow() %>%
add_model(random_forest_model) %>%
add_recipe(pokemon_recipe)
random_forest_grid <- grid_regular(mtry(range = c(1, 8)),
trees(range = c(200,700)),
min_n(range = c(1,16)),
levels = 8)
Recall that the random forest algorithm utilizes a collection of
decision trees, each built upon bagging the training set. Since
individual decision trees are sensitive to changes in the training data,
taking the mean or mode of a large number of decision trees (trained on
slightly different bootstrap aggregated data) removes this sensitivity.
The trees hyperparameter is the easiest to describe in this
case: it is simply the number of decision trees that we wish to
create in our model, and thus the number of predictions that we
will either take the mean or mode over to get our final result. The
mtry and min_n, on the other hand, control
more of the behavior of the individual decision trees. The
mtry hyperparameter controls the number of predictors that
are randomly chosen to be considered for each split in a single
tree. Similarly, min_n sets the minimum number of
observations needed to split a node — since a large number of splits in
the predictor space can lead to complex trees, small values of
mtry with large values of min_n can lead to
very complicated trees that can potentially overfit the data.
Since our recipe specifies that the outcome variable
type_1 only depends on the 8 predictors
legendary, generation, sp_atk,
attack, speed, defense,
hp, and sp_def, it does not make sense to set
mtry larger than 8 since we cannot randomly sample more
predictors than we have. Similarly, it doesn’t make sense to randomly
sample NO predictors, since it then is impossible to split the parameter
space to make decision trees.
Lastly, a random forest model with mtry=8 simply means
every decision tree will be built off of splitting the entire predictor
space (i.e. all predictors are considered when building our decision
trees).
Fit all models to your folded data using
tune_grid().
Note: Tuning your random forest model will take a few minutes to run, anywhere from 5 minutes to 15 minutes and up. Consider running your models outside of the .Rmd, storing the results, and loading them in your .Rmd to minimize time to knit. We’ll go over how to do this in lecture.
Use autoplot() on the results. What do you notice? Do
larger or smaller values of penalty and
mixture produce better ROC AUC? What about values of
min_n, trees, and mtry?
What elastic net model and what random forest model perform the best on your folded data? (What specific values of the hyperparameters resulted in the optimal ROC AUC?)
elastic_net_tune <- tune_grid(
elastic_net_wflow ,
resamples = pokemon_folds,
grid = elastic_net_grid
)
save(elastic_net_tune, file = "tuned_grids/elastic_net_tune.rda")
random_forest_tune <- tune_grid(
random_forest_wflow,
resamples = pokemon_folds,
grid = random_forest_grid
)
save(random_forest_tune, file = "tuned_grids/random_forest_tune.rda")
load("tuned_grids/elastic_net_tune.rda")
load("tuned_grids/random_forest_tune.rda")
g1<-autoplot(elastic_net_tune) + ggtitle("Elastic Net") + theme_dark()
ggplotly(g1)
From the above plot, we see that larger
values of mixture tend to make our models more accurate
when penalty is considerable small (in other words, when the
penalty value is lowest at 0.01, Lasso tends to do better
than Ridge). While the accuracy plot seems to indicate that all mixtures
behave roughly the same when larger values of penalty are
considered, the roc_auc performance plot indicates that
pure Ridge regression performs much better than pure Lasso regression in
this case.
g2 <- autoplot(random_forest_tune)
pp2 <- ggplot(g2$data, aes(x = value, y = mean, group = `# Trees`, color = `# Trees`)) +
geom_path() +
geom_point(size = 1) +
facet_grid(`.metric` ~ `Minimal Node Size`, scales = "free_y", labeller = label_both) +
labs(x = paste(unique(g2$data$name)),
y = "") +
ggtitle("Random Forest") +
theme_dark() +
theme(text = element_text(size = 6))
ggplotly(pp2)
For random forest, we notice that the roc_auc quickly
improves with the number of randomly selected predictors (i.e. the
mtry hyperparameter), but begins to taper off as more that
\(\geq 5\) randomly selected predictors
are considered. There isnt much indication from the plots above that the
minimal node size min_n or the number of trees
trees has as much affect on the accuracy of the model —
however, it is worth noting that the behavior of models with a larger
number (e.g. the Pink Curve in the case of trees = 700) of
trees seems to fluctuate much less than when the number of trees is
smaller.
Ultimately, our best performing (in terms of roc_auc)
elastic net model had penalty = 0.01 and a mixture of
mixture=0.11111, meaning that it was much closer to Ridge
regression in terms of regularization. In addition, our best random
forest model had 342 trees (i.e. trees = 342), randomly
selected 3 predictors for each split (i.e. mtry = 3) and
required a minimum of 5 observations to split a node in a decision tree
(i.e. min_n = 5)
Select your optimal random forest modelin
terms of roc_auc. Then fit that model to your training set
and evaluate its performance on the testing set.
Using the training set:
Create a variable importance plot, using vip().
Note that you’ll still need to have set
importance = "impurity" when fitting the model to your
entire training set in order for this to work.
Using the testing set:
Create plots of the different ROC curves, one per level of the outcome variable;
Make a heat map of the confusion matrix.
random_forest_final_wflow <- select_best(random_forest_tune , metric="roc_auc" ) %>%
finalize_workflow(x=random_forest_wflow)
random_forest_fit <- fit(random_forest_final_wflow , pokemon_train)
random_forest_fit %>% extract_fit_parsnip() %>%
vip()
From the above variable importance plot, we see that the base stats of a Pokémon are the most important factors in determining the primary type of a pokemon, while the least important factors were the generation and legendary status — in fact, the legendary status didnt even make it onto the top 10 most important variables. This is ultimately expected, since the distribution of Pokémon types added in each generation is expected to be roughly the same.
set.seed(3435)
random_forest_test <- fit(random_forest_final_wflow, pokemon_test)
random_forest_test_res <- augment(random_forest_test, new_data = pokemon_test)
random_forest_test_res %>%
roc_curve(type_1 , c(.pred_Bug, .pred_Fire, .pred_Normal, .pred_Grass, .pred_Psychic, .pred_Water, .pred_Other) ) %>%
autoplot()
random_forest_test_res %>%
conf_mat(truth = type_1, estimate = .pred_class) %>%
autoplot(type = "heatmap")
How did your best random forest model do on the testing set?
Which Pokemon types is the model best at predicting, and which is it worst at? (Do you have any ideas why this might be?)
Ultimately, the confusion matrix indicates that our best random
forest model seems to perfectly predict the testing data (since the only
nonzero entries are on the diagonal). However, the
roc_curve plots seem to indicate that our model predicts
the Grass and Bug types much better than the remaining five levels of
our outcome — this may be due to the fact that the base statistics of
grass and bug models have very identifiable characteristics, such as
higher Speed. On the other hand, our model seemed to be worst at
predicting the Normal and Other types — this makes sense for Other since
we are combining a large number of classes into one outcome level, thus
diluting any characteristics of the original classes.
In the 2020-2021 season, Stephen Curry, an NBA basketball player, made 337 out of 801 three point shot attempts (42.1%). Use bootstrap resampling on a sequence of 337 1’s (makes) and 464 0’s (misses). For each bootstrap sample, compute and save the sample mean (e.g. bootstrap FG% for the player). Use 1000 bootstrap samples to plot a histogram of those values. Compute the 99% bootstrap confidence interval for Stephen Curry’s “true” end-of-season FG% using the quantile function in R. Print the endpoints of this interval.
# Double-Checks that randomness is fixed every time this is run
set.seed(3435)
# Create dataframe accurately representing that 337 of the shots were successful
# Using the sample() function we can ensure this distribution is (somewhat) random
curry_three <- data.frame(attempt = 1:801)
curry_three$made_attempt <- numeric(nrow(curry_three))
curry_three$made_attempt[sample(nrow(curry_three), 337)] <- 1
# Double check that we indeed have 337 makes and 464 misses
curry_three %>%
group_by(made_attempt) %>%
summarize(count=n())
## # A tibble: 2 × 2
## made_attempt count
## <dbl> <int>
## 1 0 464
## 2 1 337
# Bootstrap our data
curry_boots <- bootstraps(curry_three, times=1000)
# create a function which extracts the boostrap sample
# using the assessment() function, and then computes the mean
# of the made_attempt predictor
sample_mean_boots <- function(split){
return(mean(assessment(split)$made_attempt))
}
# Use mutate() + map() as explained in Lab 4
boot_res <- curry_boots %>%
mutate(models = map(splits,sample_mean_boots)) %>%
unnest(models)
# Plot histogram
p<-ggplot(boot_res, aes(x=models))+
geom_histogram(color="darkblue", fill="lightblue", binwidth = 0.002) + labs(x="Sample Mean")
ggplotly(p)
# Print the 99% confidence interval for end of season FG%
quantile(boot_res$models, probs = c(0.005, 0.995))
## 0.5% 99.5%
## 0.3605221 0.4817411
Using the abalone.txt data from previous assignments,
fit and tune a random forest model to predict
age. Use stratified cross-validation and select ranges for
mtry, min_n, and trees. Present
your results. What was your final chosen model’s RMSE
on your testing set?
abalone <- read.csv("data/abalone.csv")
abalone <- transform(abalone, age=rings+1.5)
abalone_split <- initial_split(abalone, prop = 0.7, strata = age)
abalone_train <- training(abalone_split)
abalone_test <- testing(abalone_split)
abalone_folds <- vfold_cv(abalone_train, v = 5, strata = age)
abalone_recipe <- recipe(age ~ type + longest_shell + diameter +
height + whole_weight + shucked_weight +
viscera_weight + shell_weight,
data = abalone_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms= ~ starts_with("type"):shucked_weight) %>%
step_interact(terms= ~ longest_shell:diameter) %>%
step_interact(terms= ~ shucked_weight:shell_weight) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
prep(abalone_recipe) %>%
bake(new_data = abalone_train)
random_forest_model_abalone <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
random_forest_wflow_abalone <- workflow() %>%
add_model(random_forest_model_abalone) %>%
add_recipe(abalone_recipe)
random_forest_grid_abalone <- grid_regular(mtry(range = c(1, 7)),
trees(range = c(200,700)),
min_n(range = c(1,16)),
levels = 10)
Similar to Exercise 8 above, the results of our random forest
grid take a significant amount of time to compute; thus, we store our
results in an .rda file and save them for later steps.
random_forest_tune_abalone <- tune_grid(
random_forest_wflow_abalone,
resamples = abalone_folds,
grid = random_forest_grid_abalone
)
save(random_forest_tune_abalone, file = "tuned_grids/random_forest_tune_abalone.rda")
Plotting the results of our grid search below, we notice that
increasing the minimial node size min_n generally yields
much better results in terms of both our RMSE and R^2 metrics. However,
increasing the number of randomly selected predictors mtry
past mtry=3 seems to have little effect.
load("tuned_grids/random_forest_tune_abalone.rda")
g2 <- autoplot(random_forest_tune_abalone) + theme(text = element_text(size = 3))
pp2 <- ggplot(g2$data, aes(x = value, y = mean, group = `Minimal Node Size`, color = `Minimal Node Size`)) +
geom_path() +
geom_point(size = 1) +
facet_grid(`.metric` ~ `# Randomly Selected Predictors`, scales = "free_y", labeller = label_both) +
labs(x = paste(unique(g2$data$name)),
y = "") +
ggtitle("Random Forest") +
theme_dark() +
theme(text = element_text(size = 4))
ggplotly(pp2)
Ultimately, our best model (in terms of the RMSE) had 644 trees
(i.e. trees = 644), randomly selected 6 predictors for each
split (i.e. mtry = 6) and required a minimum of 16
observations to split a node in a decision tree
(i.e. min_n = 16). One way to interpret this is that a
forest made up of a large number of simpler trees performed better (on
this particular regression problem) than a small number of much more
complex trees.
random_forest_final_wflow_abalone <- select_best(random_forest_tune_abalone , metric="rmse" ) %>%
finalize_workflow(x=random_forest_wflow_abalone)
random_forest_fit_train_abalone <- fit(random_forest_final_wflow_abalone , abalone_train)
abalone_training_results <- predict(random_forest_fit_train_abalone, new_data = abalone_train %>% select(-age)) %>%
bind_cols(abalone_train %>% select(age))
random_forest_fit_test_abalone <- fit(random_forest_final_wflow_abalone , abalone_test)
abalone_testing_results <- predict(random_forest_fit_test_abalone, new_data = abalone_test %>% select(-age)) %>%
bind_cols(abalone_test %>% select(age))
tibble(dataset=c("RMSE", "R^2"),
training=c((abalone_training_results %>% rmse( age, .pred))$.estimate,
(abalone_training_results %>% rsq( age, .pred))$.estimate),
testing = c((abalone_testing_results %>% rmse( age, .pred))$.estimate,
(abalone_testing_results %>% rsq( age, .pred))$.estimate)
)
## # A tibble: 2 × 3
## dataset training testing
## <chr> <dbl> <dbl>
## 1 RMSE 1.32 1.36
## 2 R^2 0.850 0.838
Ultimately, our final model performed quite well on the testing set with a testing RMSE of 1.36 (roughly a 3% increase from the training RMSE).